# load packages
import numpy as np
import pandas as pd
import pprint as pp
import matplotlib.pyplot as plt
%matplotlib inline
# to use seaborn styling in matplotlib plots
import seaborn as sns
sns.set()
# to display offline interactive plots
import plotly as py
from plotly.offline import download_plotlyjs, init_notebook_mode, plot, iplot
init_notebook_mode(connected=True)
# cufflines binds plotly to pandas DFs
import cufflinks as cf
import plotly.tools as tls
tls.embed('https://plot.ly/~cufflinks/8')
cf.set_config_file(offline=True, world_readable=False, theme='ggplot')
Science has established a causal link to the dramatic spike in greenhouse gas emissions since the industrial revolution, the most significant of which being carbon dioxide, or CO2.
I'm not here to argue established fact: this is one case where correlation does indeed imply causation.
The two plots below speak for themselves. While the scale is different (temperature variations in ° Celsius vs. 1000 metric tonnes), the curve is nearly identical.
# load historical temperature variation data
url = 'https://climate.nasa.gov/system/internal_resources/details/original/647_Global_Temperature_Data_File.txt'
temps = pd.read_csv(url, sep= '\t', header=None)
temps.rename(columns={0:'year', 1:'min', 2:'max'}, inplace=True)
temps.set_index('year', inplace=True)
temps.plot()
plt.suptitle('Max & min variations from avg temperature for 1951-80')
plt.title('in degrees Celsius', fontsize='x-small');
From https://climate.nasa.gov/vital-signs/global-temperature/ :
"This graph illustrates the change in global surface temperature relative to 1951-1980 average temperatures. Seventeen of the 18 warmest years in the 136-year record all have occurred since 2001, with the exception of 1998. The year 2016 ranks as the warmest on record. (Source: NASA/GISS)." (Emphasis mine)
# Compare with global CO2 total *annual* (not cumulative!) emissions
# Global state of CO2
url = 'http://cdiac.ess-dive.lbl.gov/ftp/ndp030/CSV-FILES/global.1751_2014.csv'
df = pd.read_csv(url)\
.drop(0).iloc[:,0:2]\
.astype(int).set_index('Year')
# plot same for same timeframe as temp data
df.loc[1880:].plot(legend=False);
plt.suptitle('Annual CO2 emissions keep growing faster')
plt.title('in 1000 metric tonnes', fontsize='x-small');
Asking questions is more motivating
This is a depressing and scary subject. I've found lately that I've been avoiding news about climate change, and I've come across so many versions of the plots above that they no longer even register for me.
So my hope is by asking my own (basic) questions of the data, instead of passively consuming the answers to others', I'll find renewed motivated to reduce my own CO2_pc.
My questions
Not being a climate scientist, just a curious person, these are simple:
CO2_pc emitters?CO2_pc increasing and decreasing the most?CO2_pc? What is happening in the decreasing-emissions countries?While the previous plot showed global annual CO2 emissions world wide, I will start by analyzing per capita CO2 emissions by country and year.
Later on I'll introduce additional data columns.
Note about data wrangling: the csv file used here is the result of merging multiple data sources. It required an INSANE amount of data wrangling. So much so that to include it all here would disrupt the flow of the analysis. Instead, I've documented the process in another notebook, make_df_nov18.ipynb.
## Load per capita CO2 data
mypath = 'data/'
filename = 'final_df.csv'
df = pd.read_csv(mypath + filename)
# assign name attribute to df
df.name = filename.split('_')[0]
df.head()
# for now, just focus on annual CO2 & regions columns
val = 'CO2_pc'
co2_df = df.iloc[:,:5]
co2_df.info()
The following variables (columns) were downloaded from Gapminder.
Pre-1950, based on historical calculations that tabulate coal, brown coal, peat, and crude oil production by nation and year. 1950 onwards, Fossil-Fuel Burning (liquid, gas & solid), Cement Manufacture, and Gas Flaring. http://cdiac.ess-dive.lbl.gov/trends/emis/meth_reg.html
CO2_pc: Average CO2 emission in metric tons per person during the given year, calculated from converting C to CO2 (C * 3.67), then dividing the total population of the countries.
country: country reporting CO2_pc value.
year: year in which the value was reported.
The following variables I found on github, and merged (painfully) with the CO2 dataset.
These lists are the result of merging data from two sources, the Wikipedia ISO 3166-1 article for alpha and numeric country codes, and the UN Statistics site for countries' regional, and sub-regional codes. In addition to countries, it includes dependent territories. https://github.com/lukes/ISO-3166-Countries-with-Regional-Codes
region: Continent the country belongs to.
sub-region: Lower-level grouping to organizing countries. Membership is strictly based on geography, not cultural affiliations: for example, Mexico is not in North America, but Central America.
CO2_pc¶# CO2 dist since 1950
q = "year >= 1950"
co2_df.query(q)[val].hist(bins=50)
plt.title('Distribution of %s for %s' % (val, q));
# log transform this long-tailed data
co2_df.query(q)[val].apply(np.log)\
.replace([np.inf, -np.inf], np.nan).hist(bins=10);
# boxplot without outliers
co2_df.query(q)[val].plot(kind='box', showfliers = False);
# boxplot with outliers
# set style for showing outliers
flierprops = dict(marker='o', markersize=5, alpha=0.2)
co2_df.query(q)[val].plot(kind='box', flierprops = flierprops);
There are some extremely high outliers going on here.
# stats
co2_df.query(q).drop('year', axis=1).describe().round(2)
# top 10 co2 per cap. emitters for most recent year
max_yr = co2_df['year'].max()
q = "year == @max_yr"
co2_df.query(q)[['country',val, 'sub-region']].sort_values(val, ascending=False).head(10)
# bottom 10 co2 per cap. emitters for most recent year
co2_df.query(q)[['country',val, 'sub-region']].sort_values(val, ascending=False).tail(10)
Top CO2 per capita emitters are either:
And while the big emitters can be found across the world, the lowest emitters are all in Africa.
CO2_pc by region¶# How many countries in each region?
co2_df.groupby('region')['country'].nunique().plot(kind='bar', rot=-45)
plt.title('Oceania has less than half the countries of other regions');
# country counts by region
co2_df.groupby(['region'])['country'].nunique().sort_values()
# dist by region
q = "year >= 1950"
co2_df.query(q).sort_values('region').set_index(['region', 'sub-region','country','year'])\
.unstack(level=(0))[val]\
.hist(bins=20, figsize=(6,6));
Interestingly, only Europe has a quasi-normal distribution. Everywhere else has extremely negatively-skewed distributions.
The longest tail appears to be in the Americas, to judge by the x-axis range.
# apply natural log transform on CO2_pc
log_co2 = co2_df.query(q).sort_values('region').set_index(['region', 'sub-region','country','year'])\
.unstack(level=0)[val].apply(np.log)
# replace numpy infinite values with nans
log_co2.replace([np.inf, -np.inf], np.nan)\
.hist(bins=10, figsize=(6,8));
The distribution of the log CO2_pc values becomes normal except for:
## helper function
def make_regional_boxplot(df, facet, showfliers):
'''facets dataframe and shows boxplot for
each facet, with title'''
df.boxplot(by=facet, showfliers = showfliers, flierprops = flierprops, rot=45)
plt.title(df['region'].unique())
plt.show();
# boxplots by region
q = "year >= 1950"
co2_df_box = co2_df.query(q).sort_values(['region', 'sub-region'])\
.set_index(['region', 'sub-region','country','year'])[val].reset_index(level=(0,1))
co2_df_box.boxplot(by='region', flierprops = flierprops);
# Focus on the boxes, hide outliers
co2_df_box.boxplot(by='region', showfliers = False);
# helper function
def highlight_max(s):
'''
highlight the maximum in a Series yellow.
'''
is_max = s == s.max()
return ['background-color: yellow' if v else '' for v in is_max]
## quantify regional variations
q = "year >= 1950"
region_stats = co2_df.query(q).groupby('region').describe()[val]
region_stats.style.apply(highlight_max)
How many are there, and where?
# get outlier threshold for each region (3rdQuartile + 1.5 * IQR)
region_stats['outlier_th'] = region_stats.apply(
lambda region: region['75%'] + 1.5 * (region['75%'] - region['25%']), axis=1)
yr = 1950
outlier_counts = []
outlier_countries = []
for region in co2_df['region'].unique():
outlier_th = region_stats['outlier_th'].loc[region]
q = "year >= @yr & region == @region"
reg_df = co2_df.query(q)
counts_df = reg_df.loc[reg_df[val] > outlier_th]\
.groupby(['region']).size()\
.sort_values()
countries_df = reg_df.loc[reg_df[val] > outlier_th].groupby(['region', 'country']).size()
outlier_counts.append(counts_df)
outlier_countries.append(countries_df)
outliers_by_reg = pd.concat(outlier_counts).sort_values()
outliers_by_country = pd.concat(outlier_countries)
outliers_by_reg
outliers_by_country # based on region IQR, not sub-region's (which is often higher)
I still get the impression that there is a disproportionate number of tiny countries among the outlier countries in the Americas (such as St Pierre et Miquelon and the Falklands) and Europe (Luxembourg, again), possibly Africa too.
It might be worthwile looking at population or surface area to see if there's a negative correlation with CO2_pc.
Note that while tiny countries are among Oceania's outliers, that's less surprising: aside from Australia, virtually all countries in this region are very small.
## helper functions
def assign_name(row):
df = row['df']
df.name = row['labels']
def facetize_df(df, facet):
"""
Splits df into list of smaller dfs for each categorical value
"""
facet_dfs = [df.loc[df[facet]==f] for f in df[facet].unique()]
facet_list = [f for f in df[facet].unique()]
facet_dfs = pd.DataFrame({'df': facet_dfs, 'labels': facet_list})
#facet_dfs = facet_dfs.apply(assign_name, axis=1)
return facet_dfs
# boxplots by region & sub-region
# split reshaped into smaller regional co2_df's
regions_co2_dfs = facetize_df(co2_df_box, 'region')
showfliers=True
regions_co2_dfs['df'].apply(make_regional_boxplot, args=['sub-region', showfliers])
# same boxplots without outliers
showfliers = False
regions_co2_dfs['df'].apply(make_regional_boxplot, args=['sub-region', showfliers])
While it's hard to tell which region has the most variation between sub-regions, Europe seems to have the least.
Americas and Oceania I suspect have the most, based on the differences in position & shape of the IQR boxes for their respective sub-regions.
# which region is most homogenous, i.e.
# has lowest coefficient of variance?
(region_stats['std']/region_stats['mean']).sort_values().round(2)
# mean CO2 by year & region
q = 'year >= 1950'
region_year = pd.DataFrame(co2_df.query(q).groupby(['year', 'region']).mean())[val]
region_year.unstack().plot();
plt.ylabel('%s metric tonnes' % val)
plt.xlabel('')
plt.suptitle("Peak mean emissions were between 1970 & 1980")
plt.legend(loc='center left', bbox_to_anchor=(1,0.5));
# median CO2 by year & region
region_year = pd.DataFrame(co2_df.query(q).groupby(['year', 'region']).median())[val]
region_year.unstack().plot()
plt.ylabel('%s metric tonnes' % val)
plt.xlabel('')
plt.suptitle("Medians paint a clearer picture of CO2 emissions", fontsize='medium', linespacing=10)
plt.legend(loc='center left', bbox_to_anchor=(1,0.5));
Europe remains the CO2 emissions gorilla.
After steady growth between 1950 to 1980, European emissions wobbled in the 80s, then fell off a cliff around 1990. Emissions plateaued (now at late 60's levels) until around 2005, whereupon they steadily decreased.
By contrast, Asia, the Americas, and to a lesser extent Oceania have seen median CO2 emissions creep steadily upwards for the entire timeframe. You can almost imagine Asia & Amercia's curves converging with Europe's in a decade or so.
At the bottom of the medians plot, African emissions initially appear to be virtually flat, but they are increasing too. But Africa's baseline is so much lower in the first place that its CO2 growth is dwarfed by that of other regions.
# show median Africa on its own
region_year.unstack()['Africa'].plot()
plt.title("Viewed alone, we see Africa's median CO2 is now 4x higher");
Here we'll quantify the percentage change in median CO2 emssions for time period q.
# create df of start & end year medians
start = region_year.unstack().index.min()
end = region_year.unstack().index.max()
region_year.unstack().loc[[start, end]]
# show % change of median value since beginning of time frame
change = region_year.unstack().loc[[start, end], ].pct_change()
change.sort_values(2013, axis=1)\
.multiply(100).round()\
.dropna(how='all')
Wow. Usually plots are more eye-opening than numbers, but in this case it's the reverse.
Why focus on Europe?
CO2_pc)# create europe CO2 df
q = "region=='Europe' & year >= 1950"
co2_eur_df = co2_df.query(q).loc[co2_df['region']=='Europe'].copy().reset_index(drop=True)
# assign name attribute for use in plot titles
co2_eur_df.name = 'Europe CO2'
# create df of sub-region medians
q = "year >= 1950"
val = 'CO2_pc'
eur_medians = co2_eur_df.query(q).groupby(['year','region', 'sub-region'])[val].median().unstack()
eur_medians.reset_index('region').plot()
# plot
plt.suptitle('Median CO2 emissions per capita (metric tonnes)')
plt.title(str.title(q), fontsize='x-small')
plt.legend(loc='center left', bbox_to_anchor=(1,0.5));
Convergence! Since 1950, median CO2_pc in Western Europe has fallen enough while Southern Europe's has increased so that the gap has been cut in more than half.
# diff by sub-region for start & end medians
start = co2_eur_df['year'].min()
end = co2_eur_df['year'].max()
start_end = eur_medians.loc[[start, end]]
start_end
# % change by sub-region
start_end.pct_change().multiply(100).round(2).drop(start)
# change in the gap between top & bottom sub-regions
start_end.max(axis=1).subtract(start_end.min(axis=1))
While data cleaning and looking at outliers, I noticed the splitting off of countries after the fall of the Soviet Union creates potential duplicate data. As European CO2_pc values tend to be much higher than anywhere else, duplicate values here can distort results.
# verify if these countries share time periods
germs = ['East Germany', 'West Germany', 'Germany']
sov = ['Russia', 'USSR']
czech = ['Czechoslovakia', 'Czech Republic', 'Slovak Republic']
serb = ['Serbia', 'Serbia And Montenegro', 'Montenegro']
all_dupes = [germs, sov, czech, serb]
all_dupes = [[word.lower() for word in lst] for lst in all_dupes]
# compare on time series plot
pd.Series(all_dupes).apply(
lambda l: co2_df.loc[co2_df['country'].str.lower().isin(l)]
.pivot('year', 'country', val)
.plot()
);
Indeed, the rows overlap. Let's drop the Soviet-era countries.
How much difference will this make?
# Remove countries that no longer exist.
# Their data is already captured by the new country name.
drop_dupes = ['East Germany', 'West Germany', 'Ussr', 'Czechoslovakia', 'Serbia And Montenegro']
drop_index = co2_df.query(q).loc[co2_df.query(q)['country'].isin(drop_dupes)].index
q = "region == 'Europe' & year >= 1950"
# before, with overlapping countries
co2_eur_orig = co2_eur_df.query(q).groupby('sub-region').describe()[val]
# after dropping the redundant countriex
co2_eur_dedupe = co2_df.query(q).drop(drop_index).query(q).groupby('sub-region').describe()[val]
# what's the difference?
co2_eur_dedupe - co2_eur_orig
# fix this in original df
co2_df = co2_df.drop(drop_index)
## helper functions
def plotly_line2(df, val):
name = df.name
df = df.pivot('year', 'country', val)
fig = df.iplot(asFigure = True,\
title = " ".join([val,name]),\
yTitle = str.title('%s in metric tonnes' % val),\
theme = 'ggplot')
return fig
def plotly_timeseries(df, value):
fig = plotly_line2(df, value)
py.offline.iplot(fig)
def plot_facet_timeseries(df, facet, val):
#df_list = facetize_df(df, facet)
#pd.Series(df_list).apply(plotly_timeseries, args=[val])
for f in df[facet].unique():
facet_df = df.loc[df[facet]==f]
facet_df.name = f
plotly_timeseries(facet_df, val)
# look at timeseries once more with de-duped data
q = "year >= 1950 & region == 'Europe'"
val = 'CO2_pc'
# create spaghetti plots for each sub-region
for reg in co2_df.query(q)['region'].unique():
reg_df = co2_df.query(q).loc[co2_df['region']==reg]
plot_facet_timeseries(reg_df, 'sub-region', val)
Virtually all countries have been trending down the last few years, except Gibraltar, Norway and Estonia.
The CO2_pc lines for countries in both E. Europe & W. Europe (after filtering out Luxembourg) are virtually parallel. I suspect this is an indicator of the interconnectedness of the economies of each of these sub-regions, and consequently that of their energy consumption.
The trajectories in S. Europe were noisier, and downright erratic in N. Europe. Similarly, I believe this is due to these regions being more heterogeneous collections of countries, with less interdependent economies.
Once more, it's the tiny countries that have the highest emissions, at least at the sub-region level. Examples are Luxembourg (a regional outlier), Faroe Islands & Estonia, and Gibraltar.
Sweden's CO2_pc has decreased so much since the late 70s that it is almost back to its 1950 level.
What prompted the CO2 declines of recent years?
Two things come to mind:
The ranges are so much larger in other regions, we'll need to use log scale.
# compare with other regions
# look at timeseries once more with de-duped data
q = "year >= 1950 & region != 'Europe'"
val = 'CO2_pc'
# convert to log scale
log_co2_df = co2_df.copy()
log_co2_df[val] = log_co2_df[val].apply(np.log)
log_co2_df.rename(columns={'CO2_pc': 'log_CO2_pc'}, inplace=True)
val = 'log_CO2_pc' # used for plot title
for reg in log_co2_df.query(q)['region'].unique():
reg_df = log_co2_df.query(q).loc[log_co2_df['region']==reg]
plot_facet_timeseries(reg_df, 'sub-region', val)
In general, less cohesive than Europe. Americas does have some of the parallel trendlines seen in E & W Europe. Again, reflection of close economic ties within sub-region, and perhaps with Europe itself?
An exception in the Amercias is the chaotic Caribbean, which perhaps not coincidentally is amost exclusively of the type of tiny countries that often have outliers.
The key different of course between any of these sub-regions and Europe's is the scale: Europe was linear, these are log.
Here I will look at 8 additional variables. Per capita energy consumption, production, and income (GDP) I expect will have strong correlations, but am curious how that may vary by country or region.
The remaining variables I chose to simply get a broader characteristics of a country, such as the level of inequality or how far up they are on the 'Human Development Index'.
All data was downloaded from the Gapminder website.
energy-use_pc (kWh): Energy use refers to use of primary energy before transformation to other end-use fuels, which is equal to indigenous production plus imports and stock changes, minus exports and fuels supplied to ships and aircraft engaged in international transport. From Electricity use, per person.energy-production_pc (tonnes of oil equivalent) From Energy production, per person (toe). hdi: A composite index measuring average ability to lead a long and healthy life,
measured by life expectancy at birth; the ability to acquire
knowledge, measured by mean years of schooling
and expected years of schooling; and the ability to
achieve a decent standard of living, measured by gross
national income per capita. See Technical note 1 at http://hdr.undp.org/. From HDI (Human Development Index)income_pc: GDP per capita measures the value of everything produced in a country during a year, divided by the number of people. The unit is in international dollars, fixed 2011 prices. The data is adjusted for inflation and differences in the cost of living between countries, so-called PPP dollars. From Income per person (GDP/per capita, PPPdollars inflation-adjusted)population: We use UN population data between 1950 to 2100 from UN Population Division World Population Prospects 2017 and the forecast to year 2100 is using their medium fertility variant. For years before 1950, this version uses the data documented in greater detail by Mattias Lindgren in version 3 below. The main source there was Angus Maddison’s data which is maintained and improved by CLIO Infra Project. From Population, total.gini_idx: A measure from 0 to 100, where 0 represents perfect equality. On Gapminder: GINI Index.military_%gdp From Military expenditure (% of GDP)gas_price_liter From Pump price for gasoline (US$ per liter)Note about data wrangling: the csv file used here is the result of merging multiple data sources. It required an INSANE amount of data wrangling. So much so that to include it all here would disrupt the flow of the analysis. Instead, I've documented the process in another notebook, make_df_nov18.ipynb.
# expand DF to include all columns
df = pd.read_csv(mypath + 'final_df.csv')
df.info()
# drop duplicate countries as per Zoom-in on Europe section
drop_index = df.loc[df['country'].isin(drop_dupes)].index
df = df.drop(drop_index)
# rename energy columns
df.rename(columns={'energy_use_pc':'energy-use_pc', 'energy_production_pc':'energy-prod_pc'}, inplace=True)
df.info()
There's a lot of null values in these new columns
# plot with log y-axis
q = "year >= 1950"
df.query(q).drop(['year'], axis=1)\
.hist(figsize=(10,10), log=True, bins=20);
# stats
df.query(q).describe().round(2)
There also seems to be a lot of zeros.
# Which columns contain 0 values?
nonzero = df.set_index(['country', 'year'])!=0.
df.set_index(['country', 'year']).mask(nonzero).count()
There seems to be a lot of zero values here that should actually be NaNs. For example, I truly doubt the price of gas anywhere, at anytime, was 0. Nor can I believe that war-torn Afghanistan spent 0% of GDP on the military during the 60s,70s, and 80s!
Similarly, the gini inequality index has a scale of 0-100, where 0 = perfect equality. While feasible, it seems extremely unlikely that so many countries achieved such a Utopian state?
I'm going to assume these 0s should be NaN.
# replace 0.0 vals with NaNs
df.replace(0., np.nan, inplace = True)
# verify
df.set_index(['country', 'year']).mask(nonzero).count()
# we should see even fewer non-null values now
df.info()
Now it's time to fill some NaNs. I'd like to use interpolation, filling in NaNs that fall between two valid values.
For this I'll create a new dataframe, with data from 1950 onwards.
# sanity check: plot medians before interpolation
index = ['region', 'sub-region', 'country']
q = "year >= 1950"
df.query(q).groupby('year').median().plot(subplots=True, figsize = (4,10));
More or less. The sharp drop for the energy columns is understandable given happened in the early 70s, when there was the OPEC oil crisis.
And what's going on with gas price?
# how many values are there for gas price?
q = "year >= 1950"
df.query(q).groupby('year')['gas_price_liter'].count().plot(kind='bar');
# and are there some really low values that were hidden from the earlier plot?
df.query(q).groupby('year')['gas_price_liter'].median().plot(kind='bar');
Well, there is data there. I'm puzzled why the gas subplot was so truncated, then. Perhaps the data is just too sparse.
# interpolate missing data when it's between 2 valid data points.
# use .transform() to interpolate country by country!
q = "year >= 1950"
df_imputed = df.set_index('country').query(q).groupby('country').transform(
lambda x: x.interpolate(method='linear',
limit_area='inside')).reset_index()
# sanity check: do the medians look realistic?
df_imputed.groupby('year').median().plot(subplots=True, figsize = (4,10));
The first time I used interpolate, I forgot that the data was in long format. This means that it would interpolate the entire column, rather than for each country within the column.
I realized my mistake when I looked created the same plot above, and saw that gini_idx and gas_price_liter suddenly had data starting from 1950, and the medians for all columns had been suspiciously smoothed-out.
# sanity check: Canada before
q = "year >=1980" # first gini val for Canada was in 1981
df.query(q).loc[df.query(q)['country']=='Canada'].head()
# sanity check: Canada after (1980 gini should still be NaN)
q = "year >=1980"
df_imputed.query(q).loc[df_imputed.query(q)['country']=='Canada'].head()
df_imputed.info()
# now let's run the stats again
q = "year >= 1950"
df_imputed.query(q).drop('year', axis=1).describe().round(2)
# look at histograms again
df_imputed.drop(['year'], axis=1)\
.hist(figsize=(10,10), bins=20);
# have correlations changed?
val = 'CO2_pc'
# BEFORE
corr_df[val].sort_values()
# AFTER
corr_df_imp = df_imputed.query(q).drop(['year'], axis=1).corr().round(2)
corr_df_imp[val].sort_values()
Hmmm. Not really.
All the columns except gini_idx, gas_price_liter and hdi are strongly negatively skewed.
Let's apply a log transform to these skewed columns.
transform_cols = [u'CO2_pc', u'energy-use_pc', u'energy-prod_pc', u'income_pc',
u'military_%gdp', u'population', ]
log_df_imputed = df_imputed.copy()
log_df_imputed[transform_cols] = log_df_imputed[transform_cols].apply(np.log)
# rename columns
new_names = [(i,'log_'+i) for i in log_df_imputed[transform_cols].columns.values]
log_df_imputed.rename(columns = dict(new_names), inplace=True)
log_df_imputed.head()
# let's revisit the histograms once more
log_df_imputed.drop(['year'], axis=1)\
.hist(figsize=(10,10), bins=10);
The columns look at least closer to normal.
# have the correlations changed this time?
val = 'log_CO2_pc'
q = "year >= 1950"
corr_log_impute = log_df_imputed.query(q).drop('year', axis=1).corr().round(2)
corr_log_impute[val].sort_values()
Yes, indeed they've changed: hdi has increased, while the log_energy- variables are now weaker. log_military_%gdp is stronger but the correlation remains only weakly positive.
How do they correlate with each other?
# seaborn pairplot
plot_kws={'line_kws':{'color':'red'}, 'scatter_kws': {'alpha': 0.1}}
sns.pairplot(log_df_imputed.query(q).drop(['year'], axis=1),
kind='reg', diag_kind='kde', plot_kws = plot_kws);
The most interesting thing here now is just how clear a relation there is between log_CO2_pc and hdi increase. While it's close as well with log_income_pc, hdi's is more symmetrical.
Aside from no real relevations beyond what we learned from looking at the correlation matrix.
Will faceting by region reveal anything interesting?
# first, plot only vars with high correlations
# with log_CO2_pc
val = 'log_CO2_pc'
q = "year >= 1990"
log_pp_df = log_df_imputed.query(q).set_index('region')\
.drop('year', axis=1)\
.reset_index()
# show CO2 correlations by region
log_pp_df.groupby('region').corr()[val].unstack(level=1).drop(val, axis=1)
Europe's usually has the lowest correlation, while Oceania often has the highest, and these are extremely high, > .9. That's a bit suspect.
# pairplot faceted by region
plot_kws={'scatter_kws': {'alpha': 0.2}} # set style
sns.pairplot(log_pp_df, hue='region', diag_kind='kde', kind='reg', plot_kws = plot_kws);
Breaking down by region, the main differences are in:
Europe, where the data tends to differ from global trends. This is unsurprising given the continent's historical advantage in trade and industrialization, and consequently standard of living.
Oceania. However, I don't entirely trust Oceania's correlations because:
CO2_pc, even with a log transformation, is not normal.Americas, which like Oceania I wonder if regional correlations are overly influenced by the US.
Ultimately, comparing regions is like comparing apples to oranges
With the exception of Europe, there is so much variation in the distribution of countries within regions and sub-regions that it's hard to draw broad conclusions.
Every region has a 'gorilla' country (such as the United States, Russia and China), with local or very high values across multiple variables, The best example may be Australia's outsize influence resulting in extremely high correlations for Oceania.
More interesting would be to find other groups for comparison, and use the regions as a secondary categorical variable.
Almost all the variables we've been looking at are per capita. It's all too easy to get a false sense of improvement when you see the slowdown in the CO2_pc growth in recent years, at least among the industrial powerhouses.
It's time to step back and look once more at the overall picture. We can get a basic idea of carbon footprint simply by multiplying population by CO2_pc.
NOTE: While I refer to this calculation as CO2_footprint for this analysis, a true measurement of carbon footprint includes other emissions besides CO2.
So who has the biggest carbon footprint these days? China? The US? Let's see.
To prevent errors in the bubble chart, I need to convert population to integer. However, to do that I first need to convert any NaNs to 0.
# convert population to int
# to enable bubble plots later on
df_imputed['population'] = df_imputed['population'].fillna(0).astype(int)
# now create footprint column
df_imputed['CO2footprint'] = df_imputed['CO2_pc'].multiply(df_imputed['population'])
# What does the global trend line look like?
kind = 'WW annual'
val = 'CO2footprint'
unit = '(10 billion metric tonnes)'
annual_totals = df_imputed.groupby('year').sum()[val]
annual_totals.plot()
plt.title(" ".join([kind, val, unit]));
# regional timeseries
kind = 'Regions annual'
val = 'CO2footprint'
unit = '(10 billion metric tonnes)'
annual_totals = df_imputed.groupby(['region','year']).sum()[val]
annual_totals.unstack(level=0).loc[1950:].plot()
plt.suptitle(" ".join([kind, val, unit]));
1990 was an inflection point for both Asia and Europe's carbon footprints, where the former began to exceed the latter. Shortly afterwards, Asia's footprint began skyrocketing, while Europe's began a gradual downward trajectory.
How have footprints changed evolved during this time?
# what did this look like a few decades earlier?
yr = 1990
q = "year==@yr"
val = 'CO2_pc'
size = 'CO2footprint'
x = 'population'
df_imputed.query(q).iplot(kind='bubble', x=x, y=val, size=size, text='country',\
xTitle= x + " (log scale)", yTitle=val, categories='region', logx=True,\
title='Who had the biggest %s in %s?' % (size, yr))
# Who had the biggest footprint in 1990?
df_imputed[['country','CO2_pc', 'population', 'CO2footprint']].loc[df_imputed.query(q)['CO2footprint'].idxmax()]
# show in bubble plot for most recent year
max_yr = df_imputed['year'].max()
yr = max_yr
q = "year==@max_yr"
df_imputed.query(q).iplot(kind='bubble', x=x, y=val, size=size, text= 'country',\
xTitle= x + " (log scale)", yTitle=val, categories='region', logx=True,\
title='Who had the biggest %s in %s?' % (size, yr))
# who has biggest footprint now?
df_imputed[['country','CO2_pc', 'population', 'CO2footprint']].loc[df_imputed.query(q)['CO2footprint'].idxmax()]
Between 1990 & 2013, which countries saw their footprint grow, and which saw it shrink?
# change from 1990 to 2013 by country
q = "year == 1990 | year == 2013"
val = 'CO2footprint'
footprint_df = df_imputed.query(q).set_index(['year', 'country']).drop(['region', 'sub-region'], axis=1)[val]
footprint_df = footprint_df.unstack()
footprint_df
# Helper functions
def topN_change(df, row, chg, n=5):
'''Calculates difference between specified row &
previous row, returning highest N results'''
if chg == 'diff':
return df.diff().round(2).loc[row]\
.dropna().sort_values(ascending=False).head(n)
elif chg == 'pct':
return df.pct_change().multiply(100).round(2).loc[row]\
.dropna().sort_values(ascending=False)\
.head(n)
else:
"Unrecognized function type for 'chg'. Specify 'diff' or 'pct'."
def bottomN_change(df, row, chg, n=5):
'''Calculates difference between specified row &
previous row, returning lowest N results'''
if chg == 'diff':
return df.diff().round(2).loc[row]\
.dropna().sort_values().head(n)
elif chg == 'pct':
return df.pct_change().multiply(100).round(2).loc[row]\
.dropna().sort_values()\
.head(n)
else:
"Unrecognized function type for 'chg'. Specify 'diff' or 'pct'."
# which countries had the biggest % incr
top5 = topN_change(footprint_df, 2013, 'pct')
top5
These increases are insane.
# sanity check
q = "year >= 1980 & country in @top5.index"
df_imputed.query(q).set_index(['country', 'year'])['CO2footprint'].unstack(level=0)\
.plot(subplots=True, figsize=(4,8));
# why is Somalia is still missing data?
q = "country == 'Somalia' & year >= 1995 & year <=2001"
df_imputed.query(q)
Ah, Somalia is actually missing rows for those years. This is odd. But since we are concerned with only 1990 & 2013 here, I'll leave this as-is for now.
# Equatorial Guinea?
q = "country == 'Equatorial Guinea' & year >= 1985"
df_imputed.query(q).set_index('year')['CO2_pc'].plot()
Seems oil reserves were discovered in 1995. https://en.wikipedia.org/wiki/Energy_in_Equatorial_Guinea#Oil
# What about Namibia's strange spike?
q = "country == 'Namibia' & year >= 1988 & year <=1995"
df_imputed.query(q)
Ah hah. Namibia's data for both population and CO2_pc only starts in 1990. While the curve for the former looks reasonable, the latter's initial steep rise followed by a long plateau is suspect. It will be better to set this to NaN, so that Namibia is not included in the '1990-2013 % changes' data frame later on in this analysis.
Also, Equitorial Guinea and Somalia have very erratic CO2_pc curves. However, I saw the same fluctuating lines when I looked online:
# any other countries with data only starting in 1990?
df.groupby('country')['year','CO2_pc'].min().query("year == 1990")
# another sanity check
q = "year >= 1980 & country == 'Marshall Islands'"
vals = ['CO2_pc', 'population']
df_imputed.query(q).set_index(['country', 'year'])[vals].unstack(level=0)\
.plot(subplots=True);
These curves look more reasonable. Just need to fix Namibia and update footprint_df.
This analysis is focused by CO2_pc, and to create the dataset this column was the "driving" column: there should be no null CO2_pc values here. Since Namibia's is suspect for 1990, let's drop the entire row.
# get location
idx = df_imputed.loc[df_imputed['country']=='Namibia'].loc[df_imputed['year']==1990].index
df_imputed.drop(idx, inplace=True)
# verify
df_imputed.loc[df_imputed['country']=='Namibia'].loc[df_imputed['year']==1990]
# recreate footprint df
q = "year == 1990 | year == 2013"
val = 'CO2footprint'
footprint_df = df_imputed.query(q).set_index(['year', 'country']).drop(['region', 'sub-region'], axis=1)[val]
footprint_df = footprint_df.unstack()
# biggest % increase
topN_change(footprint_df, 2013, 'pct')
# biggest % decrease?
bottomN_change(footprint_df, 2013, 'pct')
# The biggest absolute incr?
topN_change(footprint_df, 2013, 'diff')
# what % is the 2013 footprint for top5_incr
# compared to world total?
q = "year == 2013"
val = 'CO2footprint'
ww_footprint = df_imputed.query(q)[val].sum()
ww_footprint.round(0)
# top5 totals for final year in df
q = "year == 2013 & country in @top5_incr"
footprints = df_imputed.query(q)[['country', val]]
pct_total = footprints[['country', val]][val]\
.divide(ww_footprint).multiply(100)\
.rename('% of Total')
pd.concat([footprints, pct_total], axis=1).sort_values('% of Total')
# can we visualize this?
q = "year == 2013"
bigfoot_bool = df_imputed.query(q)['country']\
.apply(lambda x: x if x in top5_incr else 'Rest of world')\
.rename('label')
footprint_by_label = pd.concat([df_imputed.query(q)[['country', 'CO2footprint']], bigfoot_bool], axis=1)\
.groupby('label').sum()
footprint_by_label[val].sort_values().plot(kind='pie', colormap='YlOrRd')
plt.suptitle("Footprint of countries in Top 5 biggest incr is > 50% WW footprint")
plt.title(q, fontsize='x-small');
# and the biggest absolute decr?
bottomN_change(footprint_df, 2013, 'diff')
# How many countries saw in incr in CO2 footprint between 1990 & 2013?
footprint_chg = footprint_df.pct_change().multiply(100).loc[2013]
footprint_chg.where(footprint_chg > 0).count()
# how many saw a decrease?
footprint_chg.where(footprint_chg <= 0).count()
This is depressing.
But are there distinctive patterns of top-growing footprint countries vs the top-shrinking ones?
Are there any discernible differences between these two groups?
# compare incr vs decr footprint countries
countries_incr = footprint_chg.where(footprint_chg > 0).dropna().index.values
countries_decr = footprint_chg.where(footprint_chg <= 0).dropna().index.values
# add label col
trend_bool = create_member_labs(df_imputed['country'],
group = countries_incr,
true_lab = 'incr',
false_lab = 'decr')
# append to df
df_compare = append_col(df_imputed, trend_bool)
# compare means for incr vs decr
q = "year >= 2013"
df_compare.query(q).drop('year', axis=1).groupby('incr_or_decr').median().round(2)
In a nutshell, decr is made up of countries with a better standard of living. The only areas where it has the lower mean is for inequality (gini_idx) and military_%gdp.
To compare these, better to use log values for the negative-skewed variables.
# create a log-value comparison df
log_df_compare = append_col(log_df_imputed, trend_bool)
# correlations
q = "year >= 1990"
val = 'log_CO2_pc'
log_df_compare.query(q).groupby('incr_or_decr').corr()[val].unstack(level=0)
# pairplots for same time period
plot_kws={'scatter_kws': {'alpha': 0.2}} # set style
palette=dict(incr='#335fa1', decr='#9b0044');
sns.pairplot(log_df_compare.query(q).drop('year', axis=1),
hue='incr_or_decr', diag_kind='kde', kind='reg', plot_kws = plot_kws, palette=palette);
The cleanest correlations for both groups seems to be hdi (and log_income_pc, which is a major component of hdi score).
However what's really interesting when looking at faceted pairplots is seeing where trends diverge or clusters form.
There are both, at least for the decr group for gini_idx, log_military_%gdp, and gas_price_liter. However these are also in noisy scatterplots for columns with a lot of imputed values.
So far we have been comparing annual values for both groups. Now I'd like to know how has each of these variables changed between 1990 & 2013, and whether it differs between the incr and decr groups.
Also, how does this vary from country to country within each group?
The best way I know to answer these questions is to make a heatmap of percentage changes.
## Helper function
def create_member_labs(series, group, true_lab, false_lab):
member_labs = series.apply(lambda x: true_lab if x in group else false_lab)\
.rename(true_lab + "_or_" + false_lab)
return member_labs
def append_col(df, col):
new_df =pd.concat([df, col], axis=1)
return new_df
# calculate % change between start & end years
# for all numeric columns
start = 1990
end = 2013
q = "year==@start | year ==@end"
# calc % change for each col, by country
changes_df = df_imputed.set_index(['country','year']).query(q).groupby(['country'])\
.transform(lambda x: x.pct_change().multiply(100).round(2))\
.dropna(how='all').reset_index()\
.drop('year', axis=1)
changes_df.head()
# create category column to label if country is in incr & decr group
trend_bool = create_member_labs(changes_df['country'],
group = countries_incr,
true_lab = 'incr',
false_lab = 'decr')
# append column to df
changes_df = append_col(changes_df, trend_bool)
changes_df.head()
Some columns have very large ranges, so we'll need to normalize the dataframe. And since looking at both negative and postive change, we'll be normalizing around 0, with a range of (-1,1).
# why normalization is needed
changes_df.groupby('incr_or_decr').agg(['min', 'max'])
## helper functions
def norm_pos_neg(df):
'''normalize positive & negative values
between (1,-1)'''
str_df = df.select_dtypes(exclude='number')
num_df = df.select_dtypes(include='number')
mins = abs(num_df.min())
maxes = num_df.max()
neg = num_df.mask(num_df > 0).div(mins) # create df of normalized neg've values
pos = num_df.mask(num_df < 0).div(maxes) # create df of normalized pos'iti've values
combo_df = pos.fillna(neg)
return append_col(combo_df, str_df)
def sort_cols_by_corr(df, val):
'''returns columns in order from highest to lowest
correlation with val, for sorting index (y-axis) of heatmap.'''
sort_order = df.corr()[val].sort_values().index.values
return sort_order
def shorten_colnames(df):
'''shortens column names to first word,
assuming underscore (_) separates words.'''
cols = df.columns
newcols = pd.Series(cols).apply(str.split, args=('_')).apply(lambda x: x[0])
newcols_dict = pd.Series(index=cols, data=newcols.values)\
.to_dict()
return newcols_dict
def plotly_heatmap(df, title, val, colors = 'Spectral'):
'''select numeric cols from df, sort x-axis by val,
& shorten names for legibility then plot.'''
df = df.select_dtypes('number')
# get new names & sort order for cols
sort_order = sort_cols_by_corr(df, val)
shortnames = shorten_colnames(df)
# make df more legible for plot
df = df.reindex(sort_order, axis=1)\
.sort_values(val)\
.dropna(how='all', axis=1)\
.rename(columns=shortnames)
df.iplot(
kind='heatmap',
colorscale=colors,
title=title
)
# plot heatmap with normalized values
val = 'CO2footprint'
title = "Pct changes (normalized) %d-%d, sort by %s" % (start, end, val)
hm_df = norm_pos_neg(changes_df) # normalize each column around 0.
plotly_heatmap(hm_df.set_index('country'), title, val)
The most extreme increases & decreases in CO2footprint are driven more from spikes & falls in CO2_pc than population.
All countries with decreasing CO2footprint had decreasing CO2_pc. The reverse is not true.
There is an interesting divergence between income_pc & hdi, even though the latter is partly derived from the former. While income_pc's correlation with CO2_pc seems driven by extremes at each end, hdi's correlation is a steady continuum. This echoes what we saw earlier in the log_CO2_pc vs hdi pairplot.
hdi seems to be growing more slowly for decreasing and borderline-increasing countries. The same applies to population.
Changes in CO2_pc generally correlate more with changes in energy-use_pc than population.
Lastly, for decreasing-footprint countries side, the only extreme values are negative (-1), and only for the top 4 columns (rows in the heatmap). Otherwise the colours are slightly more muted, i.e. mid-range.
# normalized median %changes (normalized)
hm_df.groupby('incr_or_decr').median().round(2)
# actual median %changes
changes_df.groupby('incr_or_decr').median().round(2)
The biggest divergences are in CO2_pc, energy-use_pc and gini_idx. In general, absolute % changes were usually higher in the increasing-footprint countries, and they were positive except for military spending.
#What is the regional breakdown
# for increasing-footprint group?
q = 'country in @countries_incr & year >= 1990'
df_imputed.query(q).groupby(['region']).size().plot(kind='pie');
df_imputed.query(q).groupby(['region'])['country'].nunique()
# which European countries STILL have increasing
# co2 footprints?!
q = "country in @countries_incr & region == 'Europe' & year >= 1990"
df_imputed.query(q).groupby(['region', 'country']).size()
# Regional breakdown for decreasing-footprint group
q = 'country in @countries_decr & year >= 1990'
df_imputed.query(q).groupby(['region']).size().plot(kind='pie');
df_imputed.query(q).groupby(['region'])['country'].nunique()
# show countries with descreasing footprint that are
# actually NOT in Europe
q = "country in @countries_decr & region != 'Europe' & year >= 1990"
df_imputed.query(q).groupby(['region', 'country']).size()
The good news is there are 29 countries who have managed to decrease their CO2 footprint between 1990 & 2013. The vast majority of them are in Europe.
Looking at the heatmap and comparing their median values, these decreasing-footprint countries seem less volatile, in that their rates of change for non-energy or income variables are more moderate than the increasing group, both positive and negative.
While income, population and HDI increased for both groups, it was at a far slower pace for the decreasing-footprint countries. This leads to not just to the correlation or causation question, but also the question of which came first?
Some might want to claim the efforts to reduce CO2 created economic stagnation, resulting in slower growth for HDI, income, and population.
Given most of the reductions are occuring in Europe, my hypothesis is the region already has achieved a high standard of living and security, so these countries have the wherewithal to more aggressively tackle the seemingly more "abstract" problem of CO2 emissions.
What else I've learned along the way:
Despite big progress in decreasing CO2_pc, Europe still leads the world...for now. Asia and the Americas are rapidly catching up.
Percentage-wise, CO2 emissions in Africa are growing the fastest but this is virtually invisible when compared to other regions.
China's CO2 emission increases were nearly vertical in the 2000s, so with its population it has overtaken the US in terms of CO2 footprint.
Tiny countries tend to have gigantic CO2_pc.
Ideas for further exploration:
dig deeper in the clusters and diverging trendlines between increasing and decreasing footprints for the lower-correlating variables like gini_idx.
determine how much of Europe's declining CO2 rates were due to Phase 1 Kyoto targets (higher for the EU than elsewhere), and how much were due to the fallout from the 2008 financial crisis, as they both in effect from 2008 to 2012?
on a similar note, were there any countries that saw increasing income_pc or hdi while also experiencing decreasing CO2_pc?